Nowadays, with the improvement of the technologies there are an amount of data that allows us to understand if there is any problem or not in the healthy of a person.
Here in this project we explain and we try to predict whether a
patient should be diagnosed with Diabetic or not. In this project we
will apply a Fully Bayesian Logistic Regression model aimed at
predicting the onset of diabetes mellitus in Pima Indians given
diagnostic measurements.
Diabetes is a metabolic disorders characterized by a high blood sugar
level (hyperglycemia) over a prolonged period of time. It is mainly due
to the reduction of insulin production or lack of capability by the
cells to properly respond to the insulin produced.
Kaggle, is the main platform where I found this interesting dataset where applying our main bayesian inference as the main scope of this project.
This dataset is composed by these features: The dataset used is Pima Indians Diabetes Dataset, publicly available on Kaggle : https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database, containing medical information of female patients belonging to Pima Indian heritage of age \(\geq\) 21.
Analyzing these features, we captured some interesting information for each feature within the dataset:
Pregnancies: Number of Times Pregnant (Int)
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
BloodPressure: Diastolic blood pressure (mm Hg):
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2):
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Patient Age (Years)
For each patient, the binary target variable Outcome indicates whether or not she is affected by Diabetes.
The main features detected have these summaries:
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | |
|---|---|---|---|---|---|---|---|---|
| Min. | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0780 | 21.0000 |
| 1st Qu. | 1.0000 | 99.0000 | 62.0000 | 0.0000 | 0.0000 | 27.3000 | 0.2438 | 24.0000 |
| Median | 3.0000 | 117.0000 | 72.0000 | 23.0000 | 30.5000 | 32.0000 | 0.3725 | 29.0000 |
| Mean | 3.8451 | 120.8945 | 69.1055 | 20.5365 | 79.7995 | 31.9926 | 0.4719 | 33.2409 |
| 3rd Qu. | 6.0000 | 140.2500 | 80.0000 | 32.0000 | 127.2500 | 36.6000 | 0.6262 | 41.0000 |
| Max. | 17.0000 | 199.0000 | 122.0000 | 99.0000 | 846.0000 | 67.1000 | 2.4200 | 81.0000 |
| Missing Values | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Zeros | 111.0000 | 5.0000 | 35.0000 | 227.0000 | 374.0000 | 11.0000 | 0.0000 | 0.0000 |
summary(data)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
let’s how variables are distributed:
# How variables are distributed
library(reshape2)
library(ggplot2)
gg <- melt(data)
ggplot(gg, aes(x=value, fill=variable)) +
geom_histogram(binwidth=5)+
facet_wrap(~variable)
There are no missing data, but several meaningless zero values, that
we can consider as NaN (e.g., BloodPressure = 0). So, we
replaced the zeros with the per-class median in the following
fields:
##
## Call:
## lm(formula = Outcome ~ ., data = dataBP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.01348 -0.29513 -0.09541 0.32112 1.24160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.8538943 0.0854850 -9.989 < 2e-16 ***
## Pregnancies 0.0205919 0.0051300 4.014 6.56e-05 ***
## Glucose 0.0059203 0.0005151 11.493 < 2e-16 ***
## BloodPressure -0.0023319 0.0008116 -2.873 0.00418 **
## SkinThickness 0.0001545 0.0011122 0.139 0.88954
## Insulin -0.0001805 0.0001498 -1.205 0.22857
## BMI 0.0132440 0.0020878 6.344 3.85e-10 ***
## DiabetesPedigreeFunction 0.1472374 0.0450539 3.268 0.00113 **
## Age 0.0026214 0.0015486 1.693 0.09092 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4002 on 759 degrees of freedom
## Multiple R-squared: 0.3033, Adjusted R-squared: 0.2959
## F-statistic: 41.29 on 8 and 759 DF, p-value: < 2.2e-16
Predictors with p-values greater than 0.05 might be considered not significant and could potentially be removed from the model.In this case, SkinThickness and Insulin have a p-value greater than 0.05, suggesting that they might not be statistically significant predictors of the outcome. So, we removed them.
\[~\] Here it is represented the Box Plot for all the features
\[~\] In the following, representing the distribution of the different features. Please notice that the values reported refer to intra-class percentages, i.e., the counts are normalized by dividing for the carnality of the class they refer to. By doing so, we are able to compare the different categories regardless from the class unbalance of the data set.
\[~\]
Check what is the impact of age over the Outcome
ggplot(data,aes(x=Age,fill=factor(Outcome)))+geom_density(alpha=0.4)+scale_fill_manual(values=c("blue", "red"))+labs(title="Distribution of Age")
In this dataset, diabetes seems more common across people above age of
35; this is possibly because that Type I diabetes can arise at any age,
while Type II diabetes is more common across people of age \(\geq 40\).
At young age Diabetes pedigree function was high, as age goes on its get reduced.
ggplot(data,aes(x=cut(Age,breaks=5),y=DiabetesPedigreeFunction,fill=cut(Age,breaks=5)))+geom_boxplot()+scale_fill_brewer(palette="RdBu")
\[~\]
Not surprisingly, diabetic people have an higher glycemic index, being this is one of the hillness’ symptoms.
Check the relationship between Glucose and BP
# Imports
dataBP <- read.csv("C:/Users/user/OneDrive/Документы/diabetes.csv", header = TRUE)
ggplot(dataBP,aes(x=Glucose,y=BloodPressure,size=Age,color=Outcome))+geom_jitter(alpha=0.6)+scale_color_gradient(low = 'red', high = 'blue')+labs(title="BP and Glucose level against Age")
\[~\]
Diabetes is more common among obese people; indeed obesity is one of the risk factor for developing Type II diabetes.
col = list('Diabetic' = 'blue', 'Healthy'= 'red')
ggplot(data,aes(x=BMI,fill=factor(Outcome)))+stat_density(alpha=0.6)+scale_fill_manual(values=c("blue", "red"))+labs(title="Distribution of BMI")+theme(legend.position="bottom",title=element_text("Outcome"))
###Pregnancy \[~\] Number of
Pregnancies has an impact over diabetes outcome
#Number of Pregnancies has an impact over diabetes outcome
ggplot(data,aes(x=Pregnancies,fill=factor(Outcome)))+geom_bar(position="Dodge")+scale_fill_manual(values=c("blue","red"))+scale_x_continuous(limits=c(0,16))+labs(title="Pregnancies Vs Outcome")
Relationship between Glucose and Diabetes Pedigree Function vs Pregnancies
#Relationship between Glucose and Diabetes Pedigree Function vs Pregnancies
ggplot(dataBP,aes(x=Glucose,y=DiabetesPedigreeFunction,color=Pregnancies))+geom_point()+scale_color_gradient(low = 'red', high = 'blue')
\[~\]
Diabetic people seem to suffer more from hypertension than non-diabetic patients. This may be due to the fact that some forms of diabetes (Type II) are associated with high weight, which on turn may cause high blood pressure.
\[~\]
Finding the correlation between the attributes
# Finding the correlation between the attributs
library(ggcorrplot)
corr<-round(cor(dataBP),1)
ggcorrplot(corr, hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 3,
method="circle",
colors = c("red", "white", "blue"),
title="Correlogram of Diabetes data",
ggtheme=theme_bw)
The features in our dataset are weakly correlated, hence in the
regression, there won’t be Multicollinearity problems.Age and number of
Pregancies are weakly Diabeticly correlated (\(\rho \approx 0.5\) ). This is not
surprising since elder women are more likely to have had an higher
number of Pregnancies during their life.
ggplot(dataBP,aes(x=Insulin,y=Glucose))+geom_point(aes(color=Outcome))+geom_smooth()+scale_color_gradient(low = 'red', high = 'blue')
\[~\]
In order to overcome the severe class imbalance of our data, we applied SMOTE: a data augmentation strategy consisting in oversampling the minority class (Diabetic in our case).
Eventually we obtained a dataset with the same statistics as above, but more balanced!
\[~\] - Likelihood \(\pi(y_{obs}|\theta)\): measures the goodness of fit of a statistical model to a sample of data for giving values of the unknown parameters. It is formed from the joint probability distribution of the sample, but viewed and used as a function of the parameters only, thus treating the random variables as fixed at the observed values.
Logistic regression model: Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. This is more faster than other binary classifiers.
logit link function: it uses the Cumulative Distribution Function (CDF) of the logistic distribution. A benefit to use the Logit is that the coefficients can be interpreted in the terms of odds ratios.
\[~\] We will use Bayesian Logistic Regression to model the relationship between the binary target variable \(Y \in \{0,1\}\) (diabetic = 1, healthy = 0) with the input features \(x_j\ \ ,\ \ \ j=1,...,p\):
| feature name | variable name |
|---|---|
| Pregnancies | x1 |
| Glucose | x2 |
| BloodPressure | x3 |
| BMI | x4 |
| DiabetesPedigreeFunction | x5 |
| Age | x6 |
| Outcome | Y |
Moreover, in the following we will consider the canonical notation: \(x_0=1\) in order to model the intercept. Differently from Bayesian linear regression, there is no variance term to be estimated, and only the regression parameters (\(\beta_j\)) will be estimated.
For each patient, the target variable can be modeled as a Bernoulli: \[ Y|\boldsymbol{x} \sim Ber(\theta_{\boldsymbol{x}})\] Hence: \[ \Pr(Y = 1|\boldsymbol{x}) = \mathbb{E}[Y = 1|\boldsymbol{x}] = \theta_\boldsymbol{x} \] In logistic regression, the Link function that links \(\mathbb{E}[Y = 1|\boldsymbol{x}]\) with the linear predictor \(\boldsymbol{\beta \cdot x} = \sum_{j} \beta_j \cdot x_j\) is the Logit Function:
\[ \theta_{\boldsymbol{x}} = \text{ilogit}\left(\boldsymbol{\beta \cdot x} \right) \iff \boldsymbol{\beta \cdot x} = \text{logit} \left(\theta_{x}\right) \]
where: \[ \text{ilogit}\left(z\right) = \frac{\exp(z)}{1+\exp(z)} \] is the sigmoid function, forcing \(\boldsymbol{\beta \cdot x_i}\) in the \([0,1]\) range;
and \[\text{logit}(\pi) = \log\left(\frac{\pi}{1-\pi}\right)\] Is the logit function (inverse of the sigmoid) computing the logarithm of the odds.
The overall dataset is randomly shuffled and sliced in Train and Test sets according to a 80/20 split.
Let’s say we have \(n\) observations; we can define their likelihood function as follows: \[ L_{\boldsymbol{y}}(\boldsymbol{\beta}, \boldsymbol{X}) = f(\boldsymbol{y}|\boldsymbol{\beta},\boldsymbol{X}) = \prod_{i= 1}^n f(y^{(i)}|\boldsymbol{\beta,x^{(i)}}) = \prod_{i= 1}^n \text{ilogit}( \boldsymbol{\beta \cdot x^{(i)}}) ^{y^{(i)}}\left(1-\text{ilogit}( \boldsymbol{\beta \cdot x^{(i)}})\right)^{1-y^{(i)}} \]
\[~\] The joint prior distribution of the parameters on \(\mathbf{B}^p = \{(-\infty, +\infty)\}^{p}\) can be computed as the product of their marginal distributions:
\[ \pi(\boldsymbol{\beta}) = \prod_{j= 1}^{p} \pi(\beta_j) = \pi(\beta_0)\cdot \pi(\beta_1)\cdot ...\cdot \pi(\beta_p) \]
In our case we will consider two kind of priors, expressing our belief that each \(\beta_j\) represents the true population characteristics (under our modellistic assumptions), prior to the observation of any data :
\[ \boldsymbol{\beta} \sim N_p(\boldsymbol{\mu}, \mathbb{I}_p\sigma^2) \iff\beta_j \overset{iid}{\sim} N\left(\mu_j, \sigma_j^2= 10^4\right) \ \ \ j = 1,..., p \]
\[ \beta_j \overset{iid}{\sim} \text{Laplace}\left(\mu_j,b_j = \frac{10^2}{\sqrt2}\right) \ \ \ j = 1,..., p \] For what concernes the hyperparameters, we consider both Normal and Laplace distribution centered in 0 (\(\mu_j = 0 \ , \ \ j = 1,..., p\)) Since we have weak prior knowledge, a diffuse prior distribution is used, spreading more or less evenly the probability distribution over \(\beta_j\), which is modeled in terms of assigning an high variance to the prior:
We will then compare the Deviance Information Criterion of the two models to select the most fitting one.
\[~\]
\[~\] model { for (i in 1:N){ Y[i] ~ dbern(p[i]) p[i] <- ilogit(beta0+beta1x1[i] + beta2x2[i] + beta3x3[i] + beta4x4[i] + beta5x5[i] + beta6x6[i])
}
# Defining the prior beta parameters
beta0 ~ ddexp(0, 1.0E-4)
beta1 ~ ddexp(0, 1.0E-4)
beta2 ~ ddexp(0, 1.0E-4)
beta3 ~ ddexp(0, 1.0E-4)
beta4 ~ ddexp(0, 1.0E-4)
beta5 ~ ddexp(0, 1.0E-4)
beta6 ~ ddexp(0, 1.0E-4)
}
| mean | sd | 2.5% | 25% | 50% | 75% | 97.5% | Rhat | n.eff | |
|---|---|---|---|---|---|---|---|---|---|
| beta0 | -8.6472 | 0.7593 | -10.1110 | -9.1532 | -8.6436 | -8.1331 | -7.2100 | 1.0005 | 3000 |
| beta1 | 0.1168 | 0.0316 | 0.0581 | 0.0951 | 0.1164 | 0.1389 | 0.1787 | 1.0005 | 3000 |
| beta2 | 0.0354 | 0.0034 | 0.0289 | 0.0330 | 0.0354 | 0.0376 | 0.0423 | 1.0008 | 3000 |
| beta3 | -0.0062 | 0.0081 | -0.0221 | -0.0117 | -0.0063 | -0.0007 | 0.0101 | 1.0008 | 3000 |
| beta4 | 0.1027 | 0.0150 | 0.0738 | 0.0924 | 0.1024 | 0.1128 | 0.1325 | 1.0007 | 3000 |
| beta5 | 0.8973 | 0.2855 | 0.3435 | 0.7077 | 0.8944 | 1.0909 | 1.4522 | 1.0019 | 1400 |
| beta6 | 0.0130 | 0.0094 | -0.0048 | 0.0067 | 0.0128 | 0.0193 | 0.0314 | 1.0007 | 3000 |
| deviance | 849.4309 | 3.9556 | 844.1442 | 846.6033 | 848.7129 | 851.5262 | 858.6929 | 1.0007 | 3000 |
| DIC | pD |
|---|---|
| 857.2577 | 7.8269 |
We know that \(\beta_i\) measures the marginal impact of the \(X_i\) on the odds in favor of \(Y\). Fixed this concept, we can comment the results obtained.
\[~\] # MODEL SPECIFICATION
model
{
for (i in 1:N){
Y[i] ~ dbern(p[i])
p[i] <- ilogit(beta0+beta1*x1[i] + beta2*x2[i] + beta3*x3[i] + beta4*x4[i] + beta5*x5[i] + beta6*x6[i])
}
# Defining the prior beta parameters
beta0 ~ ddexp(0, 1.0E-4)
beta1 ~ ddexp(0, 1.0E-4)
beta2 ~ ddexp(0, 1.0E-4)
beta3 ~ ddexp(0, 1.0E-4)
beta4 ~ ddexp(0, 1.0E-4)
beta5 ~ ddexp(0, 1.0E-4)
beta6 ~ ddexp(0, 1.0E-4)
}
| mean | sd | 2.5% | 25% | 50% | 75% | 97.5% | Rhat | n.eff | |
|---|---|---|---|---|---|---|---|---|---|
| beta0 | -8.6299 | 0.7467 | -10.0184 | -9.1809 | -8.6562 | -8.0994 | -7.2196 | 1.0119 | 960 |
| beta1 | 0.1165 | 0.0317 | 0.0552 | 0.0957 | 0.1162 | 0.1370 | 0.1792 | 1.0012 | 3000 |
| beta2 | 0.0350 | 0.0033 | 0.0288 | 0.0327 | 0.0350 | 0.0371 | 0.0416 | 1.0049 | 470 |
| beta3 | -0.0061 | 0.0081 | -0.0226 | -0.0113 | -0.0061 | -0.0004 | 0.0092 | 1.0058 | 480 |
| beta4 | 0.1032 | 0.0152 | 0.0732 | 0.0931 | 0.1034 | 0.1133 | 0.1325 | 1.0111 | 590 |
| beta5 | 0.8920 | 0.2866 | 0.3391 | 0.7014 | 0.8790 | 1.0800 | 1.4671 | 1.0013 | 2600 |
| beta6 | 0.0133 | 0.0093 | -0.0044 | 0.0070 | 0.0132 | 0.0192 | 0.0320 | 1.0031 | 3000 |
| deviance | 849.3543 | 3.7506 | 844.0283 | 846.6301 | 848.7282 | 851.3682 | 858.7077 | 1.0055 | 400 |
| DIC | pD |
|---|---|
| 856.3575 | 7.0032 |
The 95%-CI for \(\beta_3\) in both models contains the zero, hence the parameter does not matter in our analysis in the relation between the covariates and the outputs. Let’s have a look to the model if we remove the variable \(x_3\) = Blood Pressure:
\[~\]
| mean | sd | 2.5% | 25% | 50% | 75% | 97.5% | Rhat | n.eff | |
|---|---|---|---|---|---|---|---|---|---|
| beta0 | -8.8640 | 0.6967 | -10.2138 | -9.3367 | -8.8709 | -8.3993 | -7.4989 | 1.0008 | 3000 |
| beta1 | 0.1158 | 0.0314 | 0.0548 | 0.0951 | 0.1150 | 0.1367 | 0.1780 | 1.0008 | 3000 |
| beta2 | 0.0348 | 0.0034 | 0.0283 | 0.0325 | 0.0348 | 0.0371 | 0.0417 | 1.0006 | 3000 |
| beta4 | 0.0995 | 0.0142 | 0.0718 | 0.0898 | 0.0999 | 0.1092 | 0.1275 | 1.0019 | 1400 |
| beta5 | 0.9061 | 0.2868 | 0.3602 | 0.7104 | 0.9000 | 1.0992 | 1.4678 | 1.0023 | 1100 |
| beta6 | 0.0113 | 0.0090 | -0.0060 | 0.0051 | 0.0112 | 0.0175 | 0.0289 | 1.0011 | 3000 |
| deviance | 848.8892 | 3.4988 | 844.1271 | 846.4120 | 848.1873 | 850.6570 | 856.9820 | 1.0036 | 830 |
| DIC | pD |
|---|---|
| 854.9993 | 6.11 |
\[~\]
| mean | sd | 2.5% | 25% | 50% | 75% | 97.5% | Rhat | n.eff | |
|---|---|---|---|---|---|---|---|---|---|
| beta0 | -8.8867 | 0.7055 | -10.3427 | -9.3491 | -8.8694 | -8.4005 | -7.5533 | 1.0098 | 220 |
| beta1 | 0.1151 | 0.0321 | 0.0497 | 0.0936 | 0.1151 | 0.1369 | 0.1783 | 1.0023 | 1100 |
| beta2 | 0.0348 | 0.0034 | 0.0280 | 0.0326 | 0.0349 | 0.0371 | 0.0414 | 1.0042 | 590 |
| beta4 | 0.0998 | 0.0153 | 0.0700 | 0.0894 | 0.0998 | 0.1104 | 0.1299 | 1.0107 | 220 |
| beta5 | 0.9119 | 0.2882 | 0.3601 | 0.7118 | 0.9126 | 1.0993 | 1.4859 | 1.0017 | 3000 |
| beta6 | 0.0116 | 0.0093 | -0.0061 | 0.0054 | 0.0114 | 0.0179 | 0.0297 | 1.0058 | 380 |
| deviance | 849.1369 | 3.6672 | 844.1308 | 846.4201 | 848.5044 | 851.0922 | 858.1590 | 1.0026 | 2000 |
| DIC | pD |
|---|---|
| 855.8589 | 6.722 |
The criterion used for selecting the best model is the Deviance Information Criterion (DIC): a comparative index used for choosing among competing models. It is a measure of goodness of fit of the model (average deviance), discounted by a penalization term,representing the number of parameters (pD):
\[ DIC = p_D+ \hat{D}_{\text{avg}}(\theta) \] where:
\[\hat{D}_{\text{avg}}(\theta) \approx \frac{1}{M} \sum_{i=1}^M -2 \log\left(f\left(y|\theta^{(j)}\right) \right) \]
The model with the smallest DIC assumes Normal Priors and excludes
the variable x3; hence, the following analysis will be
referred to such setup.
\[~\] - Traceplot: The traceplots allow to assess the convergence of the markov chain by showing the time series of the chain. All chains seem to be exploring the same region of parameter values, which is a good sign. The expected outcome is to observe a stationary behavior after a number of transitions, which translates into regular oscillation within a certain range due to the chain reaching the target distribution.
Density plot: Depicts the simulated sampling distribution of the parameters, for each chain. The yellow vertical line represents the empirical mean, while the horizontal one the 95% quantile based credible interval based on the joint results of all of the three chains.
Autocorrelation Function Plot : The Auto Correlation Function(ACF) plots visualize how much the correlation between the simulated values holds in the previous states.
If the process becomes stationary, we expect the correlation to vanish as we increase the lag \(h\). In the limit, if \(\rho=0\) the simulations are iid (which of course is not our case).
\[~\] One of the main goals of performing MCMC is to approximate quantities which are functionals of the target distribution \(\pi(\cdot)\), such as the empirical mean: \[ I = \mathbb{E}_{\pi}[\beta_{j}] \approx \hat{I_t}= \sum_{i=1}^t\frac{\beta_{i}}{t} \]
Where we assume the sample \(\beta_{j}^{(1)}, ..., \beta_{j}^{(t)}\) to be simulated from suitable Markov Chain: invariant w.r.t. \(\pi(\cdot)\). Such condition is satisfied either (i) by considering as starting distribution the stationary distribution or (ii) - under the regularity conditions make the ergodic behavior (SLL) possible - by taking only the samples following the burn-in time \(T_0\), a suitable time such that \(\beta_{j}^{(T_0-1)} \approx \pi(\cdot)\).
By looking the plots we can see that the line that quickly approaches the overall mean. The point in which this happens is precisely the burn-in time.As we can see above each parameter achieves in all of the three chains generated the same end point, so means that with different initial points in these three chains, we are strictly going to have the same estimated mean parameter.
\[~\] As criterion to establish the parameter with the larges the posterior uncertainty, we can look at the width of the quantile-based credible intervals (95%):
| LB | UB | width | |
|---|---|---|---|
| beta0 | -10.3427 | -7.5533 | 2.7894 |
| beta1 | 0.0497 | 0.1783 | 0.1285 |
| beta2 | 0.0280 | 0.0414 | 0.0135 |
| beta4 | 0.0700 | 0.1299 | 0.0599 |
| beta5 | 0.3601 | 1.4859 | 1.1258 |
| beta6 | -0.0061 | 0.0297 | 0.0359 |
| LB | UB | width | |
|---|---|---|---|
| beta0 | -10.3184 | -7.5380 | 2.7804 |
| beta1 | 0.0483 | 0.1760 | 0.1276 |
| beta2 | 0.0280 | 0.0414 | 0.0134 |
| beta4 | 0.0714 | 0.1309 | 0.0595 |
| beta5 | 0.3445 | 1.4715 | 1.1270 |
| beta6 | -0.0063 | 0.0295 | 0.0358 |
As we can notice, in both cases the parameter associated to the widest credible interval is the intercept: \(\beta_0\).
\[~\]
The most correlated couple of parameters is \(\beta_4\) and \(\beta_0\).
\[~\] The empirical mean is an unbiased estimator; therefore the MSE is equal to its variance. Differently from the Vanilla Monte Carlo case, in which the estimate are computed on iid samples, in the MCMC each realization of the sample depends on the previous one, which on turns depend on its prior and so on.. As a consequence, the variance of the empirical mean of the simulated values cannot be computed as the simulated values divided by the number of simulations, but we must take into account the covariance between each pair of simulated values. The higher the correlation, the larger will be the variance of the approximation provided by the MCMC algorithm with respect to the variance of the iid simulations proper of the MC method. This implies that the Markov Chain is more inefficient than standard iid simulation. This can be quantified by the effective sample size (ESS):
\[ t_{eff}=\frac{t}{\left(1+2\sum_{k=1}^{+\infty} \rho_k \right)} \]
Used to normalize the variance of \(h(\beta_j)\) and get the variance of \(\hat{I}_t\):
\[ \sigma^2_{\hat{I}_t} = \mathbb{V}\text{ar}[\hat{I}_t] =\frac{\mathbb{V}\text{ar}_{\pi}(h(\beta_j))}{t_{eff}} = {\left(1+2\sum_{k=1}^{+\infty} \rho_k \right)} \cdot \frac{\sigma^2}{t} \]
Another estimate of the inaccuracy of the MC samples is the Monte Carlo Standard Error (MCSE). It measures the standard deviation around the posterior mean of the samples due to the uncertainty of the MCMC algorithm.
| ESS | IFiid | MCMCerror | MCSEerror | |
|---|---|---|---|---|
| beta0 | 1000 | 0.0005 | 0.0005 | 0.0216 |
| beta1 | 1000 | 0.0000 | 0.0000 | 0.0010 |
| beta2 | 1000 | 0.0000 | 0.0000 | 0.0001 |
| beta4 | 1000 | 0.0000 | 0.0000 | 0.0004 |
| beta5 | 1000 | 0.0001 | 0.0001 | 0.0091 |
| beta6 | 1000 | 0.0000 | 0.0000 | 0.0003 |
| deviance | 1000 | 0.0122 | 0.0122 | 0.1178 |
\[~\] In the following are reported some Convergence Diagnostics, aimed at verifying the presence of converge issues, which might suggest to enlarge the number of simulations or use some other types of parametrizations. We are considering different tools, such as:
\[~\] Gelman-Rubin diagnostic evaluates the convergence by analyzing the difference between multiple Markov chains. Let:
For each parameter, we compare the between-chains and within-chain estimated variances; large differences between them indicate nonconvergence.
\[ W_T = \frac{1}{M} \sum_{m=1}^3 \left[ \frac{1}{T} \sum_{t=1}^T \left(h\left(\beta_t^{(m)}\right) -\hat{I}_T^{(m)}\right ) \right] \]
where:
Based on these measures, we compute the Potential Scale Reduction factor: \[ \hat{R_T} = \sqrt{ \frac{T-1}{T} W_T + \frac{B_T}{n}} \]
The model is healthy if \(\hat{R}_T \approx 1\) and is below some threshold (usually \(\hat{R}_T < 1.1\)).
| Point Estimate | Upper CI | |
|---|---|---|
| beta0 | 1.000 | 1.001 |
| beta1 | 1.001 | 1.005 |
| beta2 | 1.000 | 1.002 |
| beta4 | 1.001 | 1.006 |
| beta5 | 1.001 | 1.004 |
| beta6 | 1.001 | 1.004 |
| deviance | 1.003 | 1.009 |
| Multivariate PSRF |
|---|
| 1.005 |
\[~\] Geweke Diagnostics consists in an inter chain convergence check. Compares the first 20% of the chain after burnin with the half last 50% percent; If the samples are drawn from the stationary distribution of the chain, the two means are equal and Geweke’s statistic has an asymptotically standard normal distribution. The test statistic is a standard Z-score: the difference between the two sample means divided by its estimated standard error. The standard error is estimated from the spectral density at zero, and so takes into account any autocorrelation.
The Z-score is calculated under the assumption that the two parts of the chain are asymptotically independent.
\[ Z = \frac{\hat{\beta}_A-\hat{\beta}_B}{\frac{1}{T_A}\hat{S}^{A}_{\beta}(0)+\frac{1}{T_B}\hat{S}^{B}_{\beta}(0)} \] where \(A\),\(B\) are two windows within the Markov chain and the standard error is estimated from the spectral density at zero and so takes into account any autocorrelation. If \(|Z|<1.96\) we accept the null hypothesis.
| Chain 1 | Chain 2 | Chain 3 | |
|---|---|---|---|
| beta0 | 0.509 | 0.378 | -0.066 |
| beta1 | 0.154 | -0.097 | -1.011 |
| beta2 | -1.221 | -0.114 | -0.821 |
| beta4 | 0.163 | -0.651 | 0.417 |
| beta5 | 0.020 | 0.396 | 0.742 |
| beta6 | 0.642 | -0.546 | 0.846 |
| deviance | -0.019 | -0.547 | -0.022 |
\[~\] Uses a test statistic to verify the null hypothesis that the values sampled from the Markov Chian come from a stationary distribution. It composes of two parts:
Stationary Test:
Halfwidth Test: If the chain passes the first part of the diagnostic, then the part of the chain that was not discarded from the first part is used to test the second part.
| Stationarity test | start iteration | p-value | Halfwidth test | Mean | Halfwidth | |
|---|---|---|---|---|---|---|
| beta0 | pass | 1 | 0.711 | pass | -8.864 | 0.043 |
| beta1 | pass | 1 | 0.773 | pass | 0.117 | 0.002 |
| beta2 | pass | 1 | 0.462 | pass | 0.035 | 0.000 |
| beta4 | pass | 1 | 0.828 | pass | 0.100 | 0.001 |
| beta5 | pass | 1 | 0.434 | pass | 0.914 | 0.018 |
| beta6 | pass | 1 | 0.288 | pass | 0.011 | 0.001 |
| deviance | pass | 1 | 0.160 | pass | 848.991 | 0.217 |
| Stationarity test | start iteration | p-value | Halfwidth test | Mean | Halfwidth | |
|---|---|---|---|---|---|---|
| beta0 | pass | 1 | 0.986 | pass | -8.880 | 0.040 |
| beta1 | pass | 1 | 0.880 | pass | 0.115 | 0.002 |
| beta2 | pass | 1 | 0.809 | pass | 0.035 | 0.000 |
| beta4 | pass | 1 | 0.806 | pass | 0.100 | 0.001 |
| beta5 | pass | 1 | 0.386 | pass | 0.889 | 0.014 |
| beta6 | pass | 1 | 0.786 | pass | 0.011 | 0.001 |
| deviance | pass | 1 | 0.689 | pass | 848.647 | 0.217 |
| Stationarity test | start iteration | p-value | Halfwidth test | Mean | Halfwidth | |
|---|---|---|---|---|---|---|
| beta0 | pass | 1 | 0.947 | pass | -8.848 | 0.044 |
| beta1 | pass | 1 | 0.230 | pass | 0.116 | 0.002 |
| beta2 | pass | 1 | 0.302 | pass | 0.035 | 0.000 |
| beta4 | pass | 1 | 0.718 | pass | 0.099 | 0.001 |
| beta5 | pass | 1 | 0.632 | pass | 0.915 | 0.018 |
| beta6 | pass | 1 | 0.506 | pass | 0.011 | 0.001 |
| deviance | pass | 1 | 0.965 | pass | 849.030 | 0.228 |
###Raftery & Lewis Diagnostic \[~\] This Introduces a MCMC diagnostic that estimates the number of iterations needed for a given level of precision in posterior samples.
coda::raftery.diag(coda.fit)
## [[1]]
##
## Quantile (q) = 0.025
## Accuracy (r) = +/- 0.005
## Probability (s) = 0.95
##
## You need a sample size of at least 3746 with these values of q, r and s
##
## [[2]]
##
## Quantile (q) = 0.025
## Accuracy (r) = +/- 0.005
## Probability (s) = 0.95
##
## You need a sample size of at least 3746 with these values of q, r and s
##
## [[3]]
##
## Quantile (q) = 0.025
## Accuracy (r) = +/- 0.005
## Probability (s) = 0.95
##
## You need a sample size of at least 3746 with these values of q, r and s
As we can see above, we need N samples to achieve these performances in these different chains.
\[~\] In order to validate our model, we can create, through simulation, a synthetic dataset from a distribution whose parameters (\(\beta\)’s) are fixed and thus, known. Then, we check if it is able to recover the true population parameters by applying it to the simulated data.
| simulation parameters | fixed parameters | ratio | |
|---|---|---|---|
| beta0 | -8.787 | -8.798 | 1.001 |
| beta1 | 0.083 | 0.114 | 1.375 |
| beta2 | 0.035 | 0.035 | 0.990 |
| beta4 | 0.109 | 0.099 | 0.909 |
| beta5 | 0.873 | 0.896 | 1.026 |
| beta6 | 0.006 | 0.011 | 1.783 |
As we can notice, the approximation ratio is \(\approx 1\) for almost all the features. We can conclude that the model is able to recover the true parameter and confirm its adequacy.
\[~\] We want to test our model by evaluating its predictive performances over new observations (test data);
\[~\] Here they are reported the results of our Bayesian Logistic Regression Classifier,with the canonical cut-off level = 0.5, compared to the ones obtained by using different Machine Learning models, namely:
| Bayesian_LR | Frequentistic_LR | RandomForest | SVM | |
|---|---|---|---|---|
| Accuracy | 0.7884615 | 0.7740385 | 0.8317308 | 0.7788462 |
| Precision | 0.8155340 | 0.9029126 | 0.8476190 | 0.7961165 |
| Recall | 0.7706422 | 0.7153846 | 0.8240741 | 0.7663551 |
| F1-score | 0.7924528 | 0.7982833 | 0.8356808 | 0.7809524 |
As we can notice, Random Forest is the best performing model; nevertheless, our Bayesian Logistic Regression has an accuracy higher than both Frequentistic Logistic Regression and Support Vector Classifier, despite a lower Precision than the former. An important remark is that we are mainly interested in achieving an high recall, since we want to detect as many Diabetic people as possible to allow them to preventive care; our model overtakes both Frequentistic LR and SVM.